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Abstract 

Assigning bond orders is a necessary and essential step for characterizing a chemical structure correctly in force 
field based simulations. Several methods have been developed to do this. They all have advantages but with 
limitations too. Here, an automatic algorithm for assigning chemical connectivity and bond order regardless of 
hydrogen for organic molecules is provided, and only three dimensional coordinates and element identities are 
needed for our algorithm. The algorithm uses hard rules, length rules and conjugation rules to fix the structures. 
The hard rules determine bond orders based on the basic chemical rules; the length rules determine bond order by 
the length between two atoms based on a set of predefined values for different bond types; the conjugation rules 
determine bond orders by using the length information derived from the previous rule, the bond angles and some 
small structural patterns. The algorithm is extensively evaluated in three datasets, and achieves good accuracy of 
predictions for all the datasets. Finally, the limitation and future improvement of the algorithm are discussed. 
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Background 

The advances in protein cloning, expression, labeling, 
purification, protein crystallization and structure deter- 
mination has resulted in a rapid increase in protein 
structures having been determined [1]. Nearly 45 years 
ago, the process to determine a small-molecule crystal 
structure consumed several months, and the bottlenecks 
were the data- collection, structure-solution and structure- 
refinements stages [2]. In recent years, X-ray crystallog- 
raphy and solution NMR, the two major techniques 
broadly used for determining the atomic structures of 
biomolecules, have been improved substantially [3]. 
There already have 85212 PDB files (2012.10.09) depos- 
ited in the RCSB Protein Data Bank [4]. For proteins 
and nucleic acids, bond orders can be easily deduced 
but it is not so easy for other types of molecules, such 
as ligands [5]. Because the resolutions of a lot of X-ray 
structures are not high enough, the hydrogen atoms and 
bond type information usually cannot be identified and 
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even the coordinates have large errors, which bring dif- 
ficulty in confirming the bond types of the structures. 
Although we can manually assign the bond types for the 
co-crystalized ligands, the procedure is tedious and 
time-consuming if one wants to deal with a large num- 
ber of structures. 

Many methods have been developed for automatically 
detecting the chemical structures [5-14]. For example, 
Baber et al. developed an algorithm to assign the bond 
orders by comparing the bond length with ideal values, 
and this algorithm uses a confidence value for each bond 
to solve the conflicts between coordination state and 
bond order [6]. Hendlich and co-workers proposed the 
BALI method to automatically assign bonds for protein 
ligand in the Brookhaven Protein Data Bank, and the 
assignment procedure of BALI has several stages, includ- 
ing the recognition of simple functional groups, the 
perception of ring systems, and the optimization of the 
assignment of alternating single and double bonds to 
networks of sp 2 hybridized atoms [9]. Froeyen and 
Herdewijn proposed a procedure to assign double and 
triple bonds in small molecules from simple PDB files; 
the method uses sigma bonds connectivity and atom 
symbols as the input, and then assigns double and triple 
bonds by translating the octet equations into an integer 
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liner program [8]. Bruno et al designed an algorithm to 
assign bonds for the molecules in the CSD (Cambridge 
Structural Database). Bruno s algorithm divides the 
whole molecule into small fragments at first, then calcu- 
lates the frequencies of the fragments in the CSD and 
uses geometry tests to obtain conditional probability, 
and finally apply Bayes' formula to assign structures [7]. 
Zhao's method detects the hybridization states of atoms 
and then uses functional groups and length rules to de- 
termine the bond types [13]. Wang and Case applied a 
recursive algorithm to assign bond types using the atom 
symbols and bond connectivity table as input. In Wangs 
algorithm, the best assignment is recognized when the 
total bond order of each atom satisfying the predefined 
atomic valence parameter [14]. 

All these available methods have their own advan- 
tages, but at the same time, have some disadvantages 
too. For example, Labute used Maximum Weighted 
Matching algorithm to assign bond orders instead of 
using group patterns in order to avoid mis-constructing 
structures that are similar to group patterns [10]. How- 
ever, due to this advantage, the structures with large co- 
ordinate errors usually cannot be fixed because this 
method heavily relies on precise coordinates. Opposite 
to Labute s work, the algorithm developed by Zhao et al. 
assigns the hybridization states for each atom first, and 
then uses a lot of functional groups to assign bond 
orders [13]. Using functional groups can avoid the dis- 
advantage of poor coordinates, but the false detections 
of hybridization and similar structures will impair bond 



perception. Wang and Case s algorithm, which has been 
implemented in the Antechamber module of AMBER 
package, is robust and widely used among the AMBER 
community; however, their method fails if the input 
molecule has open valences. 

In our work, in order to overcome the disadvantages 
mentioned above, we designed a new algorithm for the 
bond perception based on three different sets of rules: 
hard rules, length rules and conjugation rules. More- 
over, in this algorithm, we employed small structural 
patterns to assign the matched groups rather than using 
large ones, which can avoid the errors caused by similar 
structures. In conjugation rules, we divide the bonds 
that remain unfixed into several groups, and the bonds 
that connect to each other are in the same group. And 
then, in order to enhance the reliability of the results we 
fix these groups one by one instead of fixing bonds 
separately. The flow chart of our algorithm is shown in 
Figure 1. Compared with the other algorithms, our algo- 
rithm has the following advantages: (1). this algorithm 
has a better tolerance of coordinate error; (2). it can 
avoid many false perceptions because of the employ- 
ment of fewer patterns; (3). it can select a better struc- 
ture from isomers by using the bond entropy; (4). it can 
assign the bond orders for the structures without 
hydrogen. 

Results and discussion 

We used three data sets to validate our algorithm. The 
first data set collected by Ricketts et al. [15] has 17 
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Figure 1 The flow chart of the algorithm. 
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ligands extracted from the PDB files (PDB entries are 
lcla, lfcb, lfxl, lgox, 2aat, 2dhf, 2gbp, 2trm, 3cpp, 3ptb, 
4dfr, 4xia, 5xia, 7dfr, 8atc, and 8rsa), and it has been ex- 
tensively used for testing the algorithms of bond assign- 
ment. Therefore, this data set was also used by us to 
evaluate our algorithm. 

The performance of our algorithm and the other four 
algorithms for the first data set is summarized in Table 1. 
In all these algorithms, the algorithm developed by Sayle 
et al. correctly fixed all the structures. It should be noted 
that all the structures were used to train the Sayle s algo- 
rithm [12], and therefore it is understandable that the 
Sayles algorithm can fix all the structures in the first 
data set. The Labute s algorithm correctly perceived 14 
structures, the Baber s algorithm 7, and the Meng s algo- 
rithm 5. Our algorithm correctly perceived 16 structures, 
and only failed on 1 structure in 2trm. The ligand in 
2trm is benzamidine, and our method detects a cyclo- 
hexene ring instead of a benzene ring. From the front 
view shown in Figure 2, it can be found that the ring of 
six carbons is just like cyclohexane in its chair form, and 
therefore our algorithm assigns the six C-C bonds to 
single bonds because all these atoms are not on the 
same plane. 

According to the evaluation results for the first data 
set, our algorithm has excellent performance among 
those algorithms. Then, a larger data set with 179 entries 
(hydrogen suppressed) from the PDB was used for the 
further assessment of our algorithm. Labute et al. [10] 
and Zhao et al. [13] used this data set to test their algo- 
rithms. The PDB codes of these structures in the second 
data set are listed in Additional file 1: Table S2. Among 
the 179 structures, 170 were correctly interpreted 
(95.0%) by our algorithm and 9 were incorrectly per- 
ceived. The Labutes algorithm failed on 11 structures 
and the Zhao's algorithm on 9 structures. Moreover, two 
popular molecular simulation packages, including Dis- 
covery Studio (version 2.5) [16] and SYBYL-X (version 
1.3) [17], were used to perceive the molecules in the sec- 
ond data set. According to the prediction results, Dis- 
covery Studio failed on 25 structures and SYBYL even 
failed on 46 structures. That is to say, the performance 
of Discovery Studio and SYBYL is much worse than that 



Table 1 The performance of different algorithms on the 
first data set 



Algorithm 


Success number 


Success rate 


Baber 


7 


41.2% 


Meng 


5 


29.4% 


Sayle 


17 


100% 


Labute 


14 


82.4% 


Our 


16 


94.1% 



of our algorithm. The problematic structures predicted 
by our algorithm are listed in Table 2: 

1. leps and 3dfr: several bonds in leps and 3dfr that 
were supposed to be double are too long, and then 
they were assigned to single due to their lengths. The 
lengths of two C-C bonds in the ring of leps are 
1.62A and 1.57A, respectively, and are larger than 
1.49 A. The length of a C-C bond in 3DFR is 1.54A 
and was set to single according to length rules. 

2. lbzm, lnnb, lrne and 3fx2: for the ligands extracted 
from these four PDB files, several double bonds were 
incorrectly detected because they are not on the 
plane. We calculated the smallest angle between the 
bond and plane, and found that none of them 
satisfied the plane rule, and therefore all of them 
were set to single bonds. Here we may find two 
solutions to improve our algorithm: design and apply 
more structural patterns or modify the criteria used 
to check the plane. 

3. 5tln and 8xia: the terminal bond C-O is supposed to 
be double but the angle of O-C-C deviates greatly 
from 120° (90.56° and 97.63°, respectively), so our 
algorithm set their bond orders to 1 based on our 
rules. We suspect inadequate refinement of the 
ligand geometry because the same functional groups 
in the other entries can be correctly perceived by our 
algorithm. 

4. 5tim: two sulfurs in 5tim are not connected to each 
other in the correct structure. But because of short 
distance, a connection between them was established. 
Moreover, due to the non-existence of hydrogen, and 
the maximum connection number of sulfur is 4, and 
therefore we could not find out the false connection 
by detecting the number of the connected atoms and 
the valence. One way that may solve this problem is 
to calculate the hybridization of sulfur. But because 
no hydrogen is provided, even if the two sulfurs are 
connected, we still cannot find the mistake just based 
on the structure. 

Compared with the results predicted by the Labutes 
algorithm, we find that 2cgr, 3dfr and 5tln were not cor- 
rectly detected by both of the Labutes algorithm and 
our algorithm. Labute ascribed these errors to question- 
able or ambiguous geometry. According to our analysis, 
we think that the wrong perception of structures can be 
explained by the following two reasons: awkward struc- 
tures and imperfect algorithm. For awkward coordinates, 
we may analyze the large prior data to get an amend- 
ment between PDB files and true structures instead of 
using patterns, because if more patterns are used, more 
errors may be introduced. But on the other hand, we still 
need some structural patterns to deal with the 
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Figure 2 The representation of the ligand in 2trm in different orientations. 



coordinate errors when the resolution of a PDB file is 
not high. Furthermore, the judgment of atom connec- 
tion is also an important point to which we need to pay 
more attention. 

Finally, we tested our algorithm on an extensive col- 
lection of 1290 small molecules (with hydrogen) used by 
us to train a prediction model of solubility [18]. For this 
dataset, our algorithm achieves the prediction accuracy 
is 97.8% (1265 structures were correctly perceived). Out 
of the 25 failed structures, most of them (19) were 
found to have unusual bond lengths. For those failed 
cases, two possibilities lead to false perception: (1). in 
most cases distance between two atoms that is supposed 
to be connected is longer than the upper limit defined 
in our algorithm, and in some cases bonds were 
assigned to double by the length rules because of the 
short distance due to bad coordinates; (2). for some spe- 
cial structures, the rules in our algorithm cannot be ap- 
plied correctly, and four structures are shown in 
Figure 3 as examples. As shown in Figure 3, the double 
bonds colored in red were predicted to be single by our 
algorithm. For these four structures the bond types of 
them are just beyond our rules. For example, for struc- 
ture (a) in Figure 3, the length of the falsely predicted 
bond is 1.33 A, suggesting that this bond has chance to 
be double by the length rule. While applying the conju- 
gation rule, we found that cos of this bond is 0.60, which 
is larger than 0.50. Therefore this bond was set to single 
regardless of its bond length; however, it should be a 
double bond because it is in an aromatic ring. Thus, we 
can understand why the incorrect bond lengths in these 
structures cannot be handled by our algorithm because 
they are beyond the conventional values defined in our 
algorithm. We then used Discovery Studio [16], MOE 
[19] and SYBYL [17] to perceive the bond types for 
these 25 molecules that cannot be correctly predicted 
by our algorithm. We found that only one of them could 
be fixed by Discovery Studio and MOE, and none of the 
19 molecules that have unusual bond lengths could be 
fixed. However, it is interesting to find that SYBYL can 
fix seven of them. That is to say, for these 25 molecules 
that cannot be correctly predicted by our algorithm, 



SYBYL performs better than Discovery Studio, MOE 
and even our algorithm. 

Then, these 1290 molecules were minimized using the 
MMFF94 force field implemented in SYBYL [17]. Each 
structure in the sdf file was converted to the PDB for- 
mat. After applying our algorithm, all molecules could 
be fixed correctly, suggesting that when the coordinates 
of the studied small molecules are reasonable the predic- 
tion accuracy of our algorithm can almost achieve 100%. 

Conclusions 

We have presented a rule-based algorithm for automatic 
bond type perception in molecules given 3D atomic coor- 
dinates and element identities. Our algorithm is efficient to 
detect the structures when their geometries are reasonable. 
For many structures with coordinate errors, we can also 
perceive them by applying the small structural patterns. 

However, there still remain some challenges, especially 
for molecules with misconnected and false connected 
bonds. Due to the absence of hydrogen in many struc- 
tures we cannot determine whether bond efficiency is 
satisfied or not. And for the false connected bonds, we 
check the maximum connection number first, and detect 
the valence at last. In our algorithm we prefer to use 
small structural patterns to perceive structures instead 
of using a lot of large ones, and this will reduce the pos- 
sibilities of perceiving structures incorrectly caused by 
similar structures. When fixing the isomers, we use the 
entropy of bond to select the best isomer. 

In the near future we plan to improve our algorithm 
in two directions: 

1. Connection detection: We plan to design and 
employ better rules to detect the bonds which 
should be connected but are ignored by our 
algorithm and to reduce the redundant bonds that 
are over-connected. 

2. Define the entropy of bond by using more subtle 
scheme: To achieve this, a training process based on 
a lot of data is necessary. Some parameters and 
formula should be changed according to the training 
results. 
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Table 2 The nine failed structures predicted by our algorithm (Continued) 



8xia 




Methods 

Identification of connected atoms 

First, connection between atoms should be judged 
according to Equation 1. 

0.8 < dtj < n + rj + 0.4 (1) 

where r t and r y are the covalent radii of atoms i and ; 
(the detailed data is listed in Additional file 1: Table SI 
[20]), and dy is the distance between them. 

The right side of Equation 1 is derived by Meng and 
Lewis [11], and the left side was derived from our struc- 
tural analysis based on 126,292 small molecules 
extracted from the ZINC database [21], and these mole- 
cules belong to the "all-purchasable" subset in ZINC. 
When the cutoff of 0.8 was not used and only the right 
side of Equation 1 was used to detect the connection of 
small molecules, 78 molecules had false connections. 
After adding the cutoff of 0.8, the number of the mole- 
cules with false predictions was reduced to 58. There- 
fore, in order to achieve more accurate predictions, the 
cutoff of 0.8 was employed. When bond by was estab- 
lished, its initial bond order Oy was set to 1. 

This algorithm only judges the bond type between H, 
C, N, O, F, P, S, CI, Br, and I, and the other kinds of 
bond order were simply set to 1. The bonds that have 
already been fixed would not be detected in the follow- 
ing steps, except that it is in a group and the group has 
many isomers. 

Over-connected correction 

Due to the intrinsic property of each element, it should 
have the maximum connections. Here, we check 
whether C(4), N(4), P(4) or S(4) is over-connected (the 
number in parenthesis represents the maximum number 
of connections); if so, the longest bond is removed and 
the atom is check again until the number of the con- 
nections is no longer bigger than its maximum con- 
nection number. Here we only consider the above four 
kinds of elements, and the connections for the other 
elements are handled at the bottom of this algorithm. 

Hard rules 

We employed some basic chemical rules to assign the 
bond types. 



(1) We use A t to denote atom z, and set its bonded 
number, Q, be the number of the atoms that are 
connected with this atom. If the number equals to 

1.1 If the atom is hydrogen or halogen, Og is set to 1. 

1.2 If the atom is sulfur and it connects to 
phosphorus, or the atom is nitrogen and it 
connects to sulfur, Og is set to 2. 

1.3 If else, leave the bond unfixed and leave this rule. 

(2) when Q is 2 1 we apply the following rules: 

2.1 If the atom is oxygen or sulfur, Og is set to 1 as 
the valence of oxygen and sulfur is 2. 

2.2 Cj = max {Cp C/J, Q c = max {Cp C/J, if Cj = 1, skip 
step 2. 

2.3 Calculate the angle of two connected bonds (by and 
bit) using the following two equations: 

0=co S - i( f -*/H Xi _-*;) (2) 

9 = 0-180° fn (3) 

where and %k are the coordinates of atoms 
z, /, and k. 

If 0 is between 175° and 185°, these two 
bonds are considered to be linearly 
connected, and there are three kinds of 
possible schemes: when C ; equals to 2, 
suggesting that the bonded number of this 
atom is 2 and if the bond angle between its 
two bonds is between 175° and 185°, three 
adjacent bonds are therefore just in line. 
Here, we assume no continuous triple bond 
systems, and considering by is in the middle 
of them, O t j is set to 3 and then O ik is set to 
1. If A k is C or N, Oy is set to 1 and then 
0;/ c is set to 2. In the other cases, both of Oy 
and are set to 2. All models are applied if 
the previous ones do not work. 

(3) when Q equals to 3 

3.1 When the atom is N or P, and if it connects to 
oxygen (A 7 ) and O - 1, this group is fixed using 
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an acid model. If not, and all bond orders are set 
to be single. 

3.2 When the atom is S, CI, Br or I, and if it 
connects to oxygen (Aj) and C ; = 1, this group is 
fixed as an acid. If not, leave it. 

3.3 When the atom is C, this atom and its 
connected atoms are put into one group, and the 
cos of this group (Figure 4) is calculated using 
the following two equations: 



fa - 1 - %i) x (xi + 1 - Xi) .(xi + 2 - Xi) . 

COS0; = — ; r ~ ; — -I 

\{xi - 1 - xi) x [xi + 1 - Xi)\\Xi + 2 - Xi\ 
= i mod k 

(4) 

cos0 = max{ cos0 2 -|/ = 1 . . .k} (5) 

If the cos > 0.25, all the bonds connected by this 
atom are set to be single; otherwise, we will apply 
four models shown in Figure 5. 
(4) When Q equals to 4 

4.1 If the atom is C or N, because the number of the 
maximum connections is 4, and when they are 
connected with 4 atoms, all the bonds should be 
single. 

4.2 Else, if the atom is P, S, CI or Br, I, it must be an 
acid group, then fix it. 



Length rules 

For each bond By connected to A b the bond order is 
determined by the bond length. We use Z t to denote 
the atomic number of A b Z x = min {Z b Zj} and Z y = max 

{Z„Z ; }. 

1.1fA x is hydrogen, considering its valence, the bond 
order is set to 1; and if A y is halogen, the bond order 
is also set to 1 because the acid model has already 
been considered in the hard rules. 

2. The length of By is defined as Ly. If this bond can be 
found in Table 3 (bond lengths in this table were 
obtained by a parameterization process to achieve 
the best prediction accuracy), the bond order is 
assigned by the following steps. It should be noted 
that the minor changes of the bond lengths listed in 
Table 3 do not change the prediction accuracy 
obviously, and the low sensitivity of bond lengths is 
also an advantage of our algorithm. 

2.1 Ly < L 2 : the bond length is smaller than the length 
of double bond, and the bond order is set to 2. 

2.2 Ly <L 2 + 0.05: the bond type is set to 5, suggesting 
it is not sure, and it has better chance to be 2 

2.3 Ly <Li~ 0.04: the bond type is set to 4, suggesting 
that it is not sure. 

2.4 Ly < L]_: the bond type is set to 6, suggesting that 
it has better chance to be 1. 
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Figure 4 The definition of (p. S is the plane that line al and line 
a2 forms, line F is the normal line of plane S, and cp is the angle 
between line F and line a3. 



2.5 For the rest unsigned bonds, their bond orders 
are set to 1. 

3. If the bond cannot be found in Table 3, the bond 
type is set to 4. 

Each step is applied after the application of the previ- 
ous steps. 

Conjugation rules 

After applying the previous steps, part of the bonds have 
been fixed, but there are still many unassigned bonds 
(bond type equals to 4, 5 or 6). Most of them have im- 
proper bonds length and have the biggest chance to be 
aromatic. At this step, we design the conjugation rules 
to determine these unassigned bonds. 

1. For bond B i>i + 1 , atoms that connect to this bond, 
and A b A { + 1 are put into one group G(k) 

If cos > 0.50, suggesting that they are totally not on 
the same plane, the bond order is then set to 1. If O it 
i + !=6, Q > 2, Q + i > 2, Q + 1>2 and cos > 0.2, they 
are likely not on the same plane, and the bond type 
of 6 indicates that the bond order has the chance to 
be single (Oi,i + 1=1). 

2. The remaining unfixed connected bonds are divided 
into groups, and therefore we may have several 
groups for a molecule. 

2.1 Group size equals to 1: We will check if this bond 
is connected with any double bond. If the bond is 
not connected with any double bond, and this 



bond and its neighbor bonds are on plane (when 
cos < 0.2, see above), the bond order is set to 2. 
There is one exception: if the bond is at the edge 
of the structure and the number of the bonds to 
which it are connected equals to 1, they are 
always on the same plane. In this condition, we 
will check if the bond type is 5 and the angle 
between these two bonds differs from 120°in 3 
degrees. If not, the bond order is set to 1. When 
they are not on the same plane and the bond type 
is 5, we denote BAq as the bond angle between 
bond i and bond / 



BA = mm{\BAij - 120|,; = 1 . . .n) 

where ^3 ^ 4 equals to the number of the 
bonds that are connected to bond L 



(6) 



If A 2 is 

smaller than 3, the bond order is then set to 2; 
and otherwise, the bond order is set to 1. 

2.2 Group size equals to 2: we will get a return value 
after judging the previous bond orders that is 
larger than 3 and their neighbors of the two 
bonds (Table 4). 

If return = 0, 1, 2: 0 12 =1 + return/2, 0 23 = 1 + 
return%2 

If return = 4: 0 12 = 1, assign the other bonds 
using rule 2.1 

If return =5: 0 2 3 = 1, assign the other bonds 
using rule 2.1 

2.3 Fix isomers: For each bond, if one of the atoms is 
nitrogen and this nitrogen is only connected with 
one bond, the order is set to 1. For the others, a 
new rule is applied. 

Rule: for one bond, if it connects to a fixed double 
bond, its order is set to 1; otherwise, we will check if 
one of its atoms just has three bonds and two of the 
bonds are single. If so, we will have the following two 
possibilities: (1). the bond type is 5, and the bond order 
is then set to 2; (2). the minimal angle between this bond 
and the connected bonds is larger or smaller than 120°in 
3 degrees, the bond order is set to 1. If the above possi- 
bilities are not satisfied, the bond remains unfixed. 

If there still remain bonds to be fixed after applying 
the above rule, for those unfixed ones, the order of one 
bond is set to 2 each time, and then the rule is applied 
again; if those bonds cannot be fixed, we will set the next 
bond and try the rule again. Once it has been done, we 
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Table 3 The reference bond lengths (A) 

Bond L1 a L2 b 



c-c 


1.49 


1.33 


C-N 


1.43 


1.29 


0-0 


1.41 


1.28 


c-s 


1.80 


1.72 


N-N 


1.38 


1.29 


N-0 


1.39 


1.24 



a L1 represents the length of single bond; b L2 represents the length of double 
bond. 



can get one isomer. Then, the bond order for the next 
unfixed bond is set to 2, and we will check if we can get 
another suitable structure. Thus, it is quite possible that 
we can get several isomers. 

To determine the best structure from multiple iso- 
mers, we propose a new parameter: Entropy of bonds. 
For each isomer, the entropy of bonds is the number of 
bonds that satisfy the following criteria for each B^: 

1. 0^ = 1 

2. it connected to at least two double bonds 

3. The double bonds it connected at least connected to 
another bond. 



Then the structure with the largest entropy is chosen 
as the best choice. If some isomers have the same en- 
tropy values, the structure that we first meet is chosen. 
Our results show that for most cases the representative 
structures can be chosen accurately using the easy 
definition. 

Fixing disordered bonds 

We already checked the atoms to make sure they are 
not over-connected, but after the application of the rules 
above, many single bonds have been changed to double 
or triple bonds, and then some atoms are over-con- 
nected. Therefore, we use the maximum valence to 
judge if this atom is over-connected. For atom A b the 
number of the connected bonds is calculated based on 
the following equation: 

n 

iCon = O ik (7) 

k=0 

If iCon > valence, the longest single bond is deleted, 
and the structure is checked again until iCon is no 
longer larger than the maximum valence. Here we just 
handle single bond, because for the bond with the bond 
order larger than 1, its bond length is reliable if this 
bond can be judged by the rules applied above. 



Table 4 Return numbers based on small structural patterns 0 



Return number 



AH doesn't have a double bond a A A A A A 



A 22 A<|. A 3 A 2 A 22 



,3 U=0^£ 



Bond types are 5 and 6 



.Aw ^A 3 



A\ has more than 1 bonds 

Ak ^A 3 



A3 has more than 1 bonds \. ^ v ^ \ / 

012 = 4 A2 A 2 



023 > 4 x A-) /A 3 A^ A 3 



A2 has three bonds 



012 > 4 A 2 
023 = 4 A^ A 3 A^ A 3 



A 2 U=3 "A$ 

A 1\ ^A 3 Ai. A 3 



"A 



Others others 

a The dashed bond means this bond has already been fixed. The bold atom/bond means atoms/bonds around it are on the same plane. Conditions are taken into 
account from up to down. The plain bonds represents bonds to be fixed. Some fixed bonds are not given in the structures. 
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Additional file 1: Table SI. The radius of different elements; Table S2. 
The PDB codes for the structures used in the second data set. 
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